INTERSPEECH.2007 - Analysis and Assessment | Cool Papers

#1 A conservative aggressive subspace tracker [PDF] [Copy] [Kimi¹]

The need to track a subspace describing well a stream of points arises in many signal processing applications. In this work, we present a very efficient algorithm using a machine learning approach, which its goal is to de-noise the stream of input points. The algorithm guarantees the orthonormality of the representation it uses. We demonstrate the merits of our approach using simulations.

#2 Mutual information and the speech signal [PDF] [Copy] [Kimi¹]

Authors: Mattias Nilsson ; W. Bastiaan Kleijn

Mutual information is commonly used in speech processing in the context of statistical mapping. Examples are the optimization of speech or speaker recognition algorithms, the computation of performance bounds on such algorithms, and bandwidth extension of narrow-band speech signals. It is generally ignored that speech-signal derived data usually have an intrinsic dimensionality that is lower than the dimensionality of the observation vectors (the dimensionality of the embedding space). In this paper, we show that such reduced dimensionality can affect the accuracy of the mutual information estimate significantly. We introduce a new method that removes the effects of singular probability density functions. The method does not require prior knowledge of the intrinsic dimensionality of the data. It is shown that the method is appropriate for speech-derived data.

#3 Spectro-temporal analysis of speech using 2-d Gabor filters [PDF] [Copy] [Kimi¹]

Authors: Tony Ezzat ; Jake Bouvrie ; Tomaso Poggio

We present a 2-D spectro-temporal Gabor filterbank based on the 2-D Fast Fourier Transform, and show how it may be used to analyze localized patches of a spectrogram. We argue that the 2-D Gabor filterbank has the capacity to decompose a patch into its underlying dominant spectro-temporal components, and we illustrate the response of our filterbank to different speech phenomena such as harmonicity, formants, vertical onsets/offsets, noise, and overlapping simultaneous speakers.

#4 A comparative study of speech rate estimation techniques [PDF] [Copy] [Kimi¹]

Authors: Tomas Dekens ; Mike Demol ; Werner Verhelst ; Piet Verhoeve

In this paper we evaluate the performance of 8 different speech rate estimators [1, 2, 3, 4, 5] previously described in the literature by applying them on a multilingual test database [6]. All the estimators show an underestimation at high speech rates and some also suffer from an overestimation at low speech rates. Overall the tested methods obtain high correlation coefficients with the reference speech rate. The Temporal Correlation and Selected Sub-band Correlation method (tcssbc), which uses sub-band and time domain correlation for detecting the number of vowels or diphthongs present in the speech signal, shows little errors and appears to be the most appropriate overall technique for speech rate estimation.

#5 Spectro-temporal processing for blind estimation of reverberation time and single-ended quality measurement of reverberant speech [PDF] [Copy] [Kimi¹]

Authors: Tiago H. Falk ; Hua Yuan ; Wai-Yip Chan

Auditory spectro-temporal representations of reverberant speech are investigated for blind estimation of reverberation time ( RT) and for single-ended measurement of speech quality. The auditory representations are obtained from an eight-filter filterbank which is used to extract the modulation spectra from temporal envelopes of the speech signal. Gaussian mixture models (GMM), one for each modulation channel and trained on clean speech signals, serve as reference models of normative speech behavior. Consistency measures, computed between reverberant test signals and each GMM, are mapped to an estimated RT and to an estimated quality score. Experiments show that the proposed measures achieve superior performance relative to current "state-of-art" algorithms.

#6 Linear prediction of audio signals [PDF] [Copy] [Kimi¹]

Authors: Toon van Waterschoot ; Marc Moonen

Linear prediction (LP) is a valuable tool for speech analysis and coding, due to the efficiency of the autoregressive model for speech signals. In audio analysis and coding, the sinusoidal model is much more popular, which is partly due to the poor performance of audio LP. By examining audio LP from a spectral estimation point of view, we observe that the distribution of the audio signal's dominant frequencies in the Nyquist interval is a critical factor determining LP performance. In this framework, we describe five existing alternative LP methods and illustrate how they all attempt to solve the observed frequency distribution problem.

#7 Stabilised weighted linear prediction - a robust all-pole method for speech processing [PDF] [Copy] [Kimi¹]

Authors: Carlo Magi ; Tom Bäckström ; Paavo Alku

Weighted linear prediction (WLP) is a method to compute all-pole models of speech by applying temporal weighting of the residual energy. By using short-time energy (STE) as a weighting function, the algorithm over-weight those samples that fit the underlying speech production model well. The current work introduces a modified WLP method, stabilised weighted linear prediction (SWLP) leading always to stable all-pole models whose performance can be adjusted by changing the length (denoted by M) of the STE window. With a large M value, the SWLP spectra become similar to conventional LP spectra. A small value of M results in SWLP filters similar to those computed by the minimum variance distortionless response (MVDR) method. The study compares the performances of SWLP, MVDR, and conventional LP in spectral modelling of speech sounds corrupted by Gaussian additive white noise. Results indicate that SWLP is the most robust method against noise especially with a small M value.

#8 Conditionally linear Gaussian models for estimating vocal tract resonances [PDF] [Copy] [Kimi¹]

Authors: Daniel Rudoy ; Daniel N. Spendley ; Patrick J. Wolfe

Vocal tract resonances play a central role in the perception and analysis of speech. Here we consider the canonical task of estimating such resonances from an observed acoustic waveform, and formulate it as a statistical model-based tracking problem. In this vein, Deng and colleagues recently showed that a robust linearization of the formant-to-cepstrum map enables the effective use of a Kalman filtering framework. We extend this model both to account for the uncertainty of speech presence by way of a censored likelihood formulation, as well as to explicitly model formant cross-correlation via a vector autoregression, and in doing so retain a conditionally linear and Gaussian framework amenable to efficient estimation schemes. We provide evaluations using a recently introduced public database of formant trajectories, for which results indicate improvements from twenty to over 30% per formant in terms of root mean square error, relative to a contemporary benchmark formant analysis tool.

#9 Time-varying pre-emphasis and inverse filtering of speech [PDF] [Copy] [Kimi¹]

Authors: Karl Schnell ; Arild Lacroix

In this contribution, a time-varying linear prediction method is applied to speech processing. In contrast to the commonly used linear prediction approach, the proposed time-varying method considers the continuous time evolution of the vocal tract and, additionally, avoids block-wise processing. On the assumption that the linear predictor coefficients evolve linearly in sections and continuously over the whole signal, the optimum time-varying coefficients can be determined quasi-analytically by a least mean square approach. The investigations show that the method fits very well the realization of a time-varying pre-emphasis. Furthermore, the results show that the method is suitable for time-varying inverse filtering.

#10 Reconstructing audio signals from modified non-coherent hilbert envelopes [PDF] [Copy] [Kimi¹]

Authors: Joachim Thiemann ; Peter Kabal

In this paper, we present a speech and audio analysis-synthesis method based on a Basilar Membrane (BM) model. The audio signal is represented in this method by the Hilbert envelopes of the responses to complex gammatone filters uniformally spaced on a critical band scale. We show that for speech and audio signals, a perceptually equivalent signal can be reconstructed from the envelopes alone by an iterative procedure that estimates the associated carrier for the envelopes. The rate requirement of the envelope information is reduced by low-pass filtering and sampling, and it is shown that it is possible to recover a signal without audible distortion from the sampled envelopes. This may lead to improved perceptual coding methods.

#11 A flexible spectral modification method based on temporal decomposition and Gaussian mixture model [PDF] [Copy] [Kimi¹]

Authors: Binh Phu Nguyen ; Masato Akagi

This paper presents a new spectral modification method to solve two drawbacks of conventional spectral modification methods, insufficient smoothness of the modified spectra between frames and ineffective spectral modification. To overcome the insufficient smoothness, a speech analysis technique called temporal decomposition (TD) is used to model the spectral evolution. Instead of modifying the speech spectra frame by frame, we only need to modify event targets and event functions, and the smoothness of the modified speech is ensured by the shape of the event functions. To overcome the ineffective spectral modification, we explore Gaussian mixture model (GMM) parameters for an input of TD to model the spectral envelope, and develop a new method of modifying GMM parameters in accordance with formant scaling factors. Experimental results show that the effectiveness of the proposed method is verified in terms of the smoothness of the modified speech and the effective spectral modification.

#12 A comparison of estimated and MAP-predicted formants and fundamental frequencies with a speech reconstruction application [PDF] [Copy] [Kimi¹]

Authors: Jonathan Darch ; Ben Milner

This work compares the accuracy of fundamental frequency and formant frequency estimation methods and maximum a posteriori (MAP) prediction from MFCC vectors with hand-corrected references. Five fundamental frequency estimation methods are compared to fundamental frequency prediction from MFCC vectors in both clean and noisy speech. Similarly, three formant frequency estimation and prediction methods are compared. An analysis of estimation and prediction accuracy shows that prediction from MFCCs provides the most accurate voicing classification across clean and noisy speech. On clean speech, fundamental frequency estimation outperforms prediction from MFCCs, but as noise increases the performance of prediction is significantly more robust than estimation. Formant frequency prediction is found to be more accurate than estimation in both clean and noisy speech. A subjective analysis of the estimation and prediction methods is also made by reconstructing speech from the acoustic features.

#13 Effect of incomplete glottal closures on estimates of glottal waves via inverse filtering of vowel sounds [PDF] [Copy] [Kimi¹]

Authors: Huiqun Deng ; Douglas O'Shaughnessy

Glottal waves obtained via inverse filtering vowel sounds may contain residual vocal-tract resonances due to incomplete glottal closures. This paper investigates the effect of incomplete glottal closures on the estimates of the glottal waves via inverse filtering. It shows that such a residual resonance appears as stationary ripples superimposed on the derivatives of the original glottal wave over a whole glottal cycle. Knowing this, one can determine if there are significant resonances of vocal tracts in the obtained glottal waves. It also shows that given an incomplete glottal closure, better estimates of glottal waves can be obtained from large lip-opening vowel sounds than from other sounds. The glottal waves obtained from /scripta/ produced by male and female subjects are presented. The obtained glottal waves during rapid vocal-fold collisions exhibit transient positive derivatives, which are explained by the air squeezed by the colliding vocal folds and the air from the glottal chink.

#14 Vocal tract and area function estimation with both lip and glottal losses [PDF] [Copy] [Kimi¹]

Authors: Kaustubh Kalgaonkar ; Mark A. Clements

Traditional algorithms simplify the lattice recursion for evaluation of the PARCOR's by localizing the loss in vocal tract at one of its ends, the lips or the glottis. In this paper we present a framework for mapping to pseudo areas the VT transfer function with no rigid constraints on the losses in system, thereby allowing losses to be present at both the lips and glottis. This method allows us to calculate the reflection coefficients at both the glottis (rG) and the lips (rLip).

#15 Detection of instants of glottal closure using characteristics of excitation source [PDF] [Copy] [Kimi¹]

Authors: S Guruprasad ; B Yegnanarayana ; K Sri Rama Murty

In this paper, we propose a method for detection of glottal closure instants (GCI) in the voiced regions of speech signals. The method is based on periodicity of significant excitations of the vocal tract system. The key idea is the computation of coherent covariance sequence, which overcomes the effect of dynamic range of the excitation source signal, while preserving the locations of significant excitations. The Hilbert envelope of linear prediction residual is used as an estimate of the source of excitation of the vocal tract system. Performance of the proposed method is evaluated in terms of the deviation between true GCIs and hypothesized GCIs, using clean speech and degraded speech signals. The signal-to-noise ratio (SNR) of speech signals in the vicinity of GCIs has significant bearing on the performance of the proposed method. The proposed method is accurate and robust for detection of GCIs, even in the presence of degradations.

#16 A comparative evaluation of the zeros of z transform representation for voice source estimation [PDF] [Copy] [Kimi¹]

Authors: Nicolas Sturmel ; Christophe D'Alessandro ; Boris Doval

A new method for voice source estimation is evaluated and compared to Linear Prediction (LP) inverse filtering methods (autocorrelation LPC, covariance LPC and IAIF [1]). The method is based on a causal/anticausal model of the voice source and the ZZT (Zeros of Z-Transform) representation [2] for causal/anticausal signal separation. A database containing synthetic speech with various voice source settings and natural speech with acoustic and electro-glottographic signals was recorded. Formal evaluation of source estimation is based on spectral distances. The results show that the ZZT causal/anticausal decomposition method outperforms LP in voice source estimation both for synthetic and natural signals. However, its computational load is much heavier (despite a very simple principle) and the method seems sensitive to noise and computation precision errors.

#17 Women's vocal aging: a longitudinal approach [PDF] [Copy] [Kimi¹]

Author: Markus Brückl

A quasi-experimental longitudinal paired-samples study was carried out to explore, whether aging for 5 years can (1) audibly and (2) measurably change women's vocalisations, and if so, on which acoustic information (3) the listeners' performance possibly could relay on and (4) which parameters can contribute to detect the chronological difference.

#18 Effect of intensive voice therapy on vocal tremor for parkinson speakers [PDF] [Copy] [Kimi¹]

Authors: Laurence Cnockaert ; Jean Schoentgen ; Canan Ozsancak ; Pascal Auzou ; Francis Grenez

The effect of intensive voice therapy (Lee Silverman Voice Treatment, LSVT) on vocal tremor features of Parkinson speakers is presented. Vocal tremor is the low-frequency variation of the vocal frequency. Its features differ for Parkinson and normophonic speakers. Here, vocal tremor features have been estimated for a corpus of speakers with Parkinson's disease, recorded before and after intensive voice therapy. Results show that the treatment has significant effects on vocal tremor amplitude: Vocal tremor amplitude has decreased right after treatment. After six month, it has increased again, but is still lower than before treatment.

#19 Assessment of vocal dysperiodicities in connected disordered speech [PDF] [Copy] [Kimi¹]

Authors: A. Alpan ; A. Kacha ; Francis Grenez ; Jean Schoentgen

The aim of the presentation is to investigate acoustic analysis of connected speech by means of an average-equalized and energy-equalized variogram to extract vocal dysperiodicities. The variogram enables positioning a current and a lagged analysis frame in adjacent speech cycles to track inter-cycle dysperiodicities. Average and energy equalization of the analysis frames are options that make it possible to compensate for slow deterministic changes of the speech signal amplitude in connected speech. The instantaneous dysperiodicity trace has been summarized by means of segmental and global signal-to-dysperiodicity ratios. Results show that signal-to-dysperiodicity ratios obtained by variogram analysis correlate strongly with the perceived degree of hoarseness when the analysis frames are energy-equalized. Equalizing the frame averages removes small artifacts in the instantaneous dysperiodicity trace that are caused by sound-to-sound transients or intrusive low-frequency noise.

#20 Effects of FE modelled consequences of tonsillectomy on perceptual evaluation of voice [PDF] [Copy] [Kimi¹]

Authors: Anne-Maria Laukkanen ; Jaromír Horáček ; Pavel Švancara ; Elina Lehtinen

This study aimed to investigate the effects of a tonsillectomy on the perceived overall voice quality and timbre. Computer simulations of five Czech vowels were made, including both the calculated resonance effects of large tonsils (size 1.6 cm3) and the resonances without tonsils. The simulations were made using a finite element model of the vocal tract, based on magnetic resonance images. The size and shape of the tonsils were ascertained from clinical data. The generated pressure outputs were transformed into sound records presented to 10 trained listeners. Formant frequencies of the simulated samples were measured. The samples with and without tonsils did not differ significantly from each other in voice quality. F3 was significantly lower and the timbre was darker without tonsils. Thus, the effects of tonsillectomy on voice may be perceptible, at least in the case of large tonsils. The effect, however, may disappear in time due to changes in the tissue and due to compensatory changes in articulation.

#21 Speech quality after major surgery of the oral cavity and oropharynx with microvascular soft tissue reconstruction [PDF] [Copy] [Kimi¹]

Authors: Irma M. Verdonck-de Leeuw ; Louis ten Bosch ; Li Ying Chao ; Rico N. P. M. Rinkel ; Pepijn A. Borggreven ; Lou Boves ; C. René Leemans

Speech quality of patients with oral or oropharyngeal carcinoma was assessed by perceptual and acoustic-phonetic analyses. Speech recordings of running speech of patients before and 6 and 12 months after treatment for oral or oropharyngeal cancer and of 18 control speakers were evaluated regarding intelligibility, nasality and articulation, which revealed deteriorated speech in 20% of the patients before treatment, and in 75% 6-12 months after treatment. Acoustic analyses comprised formant, duration, perturbation and noise measures of the vowels /i/, /a/, and /u/ and were performed on the speech samples 6 months after treatment and the controls. Patients appeared to have a smaller vowel space compared to controls, which was clearly related to speech intelligibility. Furthermore, voice perturbation appeared to be higher in patients. Although oropharyngeal treatment does not effect the function of the larynx itself, the acoustic coupling between source and filter may effect the smoothness of the voicing characteristics. The presented speech analyses may serve as part of an outcome measurement protocol for assessing efficacy of speech rehabilitation.

#22 Voice fatigue and use of speech recognition: a study of voice quality ratings [PDF] [Copy] [Kimi¹]

Authors: Christel de Bruijn ; Sandra Whiteside

Previous studies have suggested the use of speech recognition software may be related to the development of voice problems. The aim of this study is to investigate the effects of using such software on perceptual voice quality. In particular, the variables type of speech recognition (discrete and continuous) and vocal load of a speaker are considered. One of the most consistent results was a rise in pitch, a common finding in voice fatigue studies. It is interpreted as part of a hyperfunctional mechanism countering early signs of voice fatigue.

#23 Complementary approaches for voice disorder assessment [PDF] [Copy] [Kimi¹]

Authors: Jean-François Bonastre ; Corinne Fredouille ; A. Ghio ; A. Giovanni ; G. Pouchoulin ; J. Révis ; B. Teston ; P. Yu

This paper describes two comparative studies of voice quality assessment based on complementary approaches. The first study was undertaken on 449 speakers (including 391 dysphonic patients) whose voice quality was evaluated in parallel by a perceptual judgment and objective measurements on acoustic and aerodynamic data. Results showed that a nonlinear combination of 7 parameters allowed the classification of 82% voice samples in the same grade as the jury. The second study relates to the adaptation of Automatic Speaker Recognition (ASR) techniques to pathological voice assessment. The system designed for this particular task relies on a GMM based approach, which is the state-of-the-art for ASR. Experiments conducted on 80 female voices provide promising results, underlining the interest of such an approach. We benefit from the multiplicity of theses techniques to evaluate the methodological situation which points fundamental differences between these complementary approaches (bottom-up vs. top-down, global vs. analytic). We also discuss some theoretical aspects about relationship between acoustic measurement and perceptual mechanisms which are often forgotten in the performance race.

#24 Frequency study for the characterization of the dysphonic voices [PDF] [Copy] [Kimi¹]

Authors: G. Pouchoulin ; Corinne Fredouille ; Jean-François Bonastre ; A. Ghio ; A. Giovanni

Concerned with pathological voice assessment, this paper aims at characterizing dysphonia in the frequency domain for a better understanding of relating phenomena while most of the studies have focused only on improving classification systems for diagnosis help purposes. In this context, a GMM-based automatic classification system is applied on different frequency ranges in order to investigate which ones are relevant for dysphonia characterization. Experiment results demonstrate that the low frequencies [0-3000]Hz are more relevant for dysphonia discrimination compared with higher frequencies.

#25 Acoustic correlates of laryngeal-muscle fatigue: findings for a phonometric prevention of acquired voice pathologies [PDF] [Copy] [Kimi¹]

Author: Victor J. Boucher

This presentation focuses on the problem of defining valid acoustic correlates of vocal fatigue seen as a physiological condition that can lead to voice pathologies. Several findings are reported based on a corpus of recordings involving electromyography (EMG) of laryngeal muscles and voice acoustics. The recordings were obtained in sessions of vocal effort extending across 12-14 hours. A known technique for estimating muscle fatigue is applied involving "spectral compression" of EMG potentials. The results show critical changes at given times of day. In examining the effects of these changes on voice acoustics, there is no linear correlation with respect to conventional acoustic parameters, but peaks in voice tremor occur at points of critical change in muscle fatigue. Further results are presented showing the need to take into account compensatory muscle actions in defining phonometric signs of vocal fatigue.